Chenzi Xu
MPhil DPhil (Oxon)
University of York
2022/12/04 (updated: 2022-12-12)
1 2 3 4 5
1 2 3 4 5
The sampling distribution of means is normal, provided that:
Illustration from Wikipedia
1 2 3 4 5
The Exponential distribution has a parameter \(\lambda\).
1 2 3 4 5
It represents the range of plausible value of the \(\mu\) parameter.
If you take samples repeatedly and compute the CI each time, 95% of those CIs will contain the true population mean \(\mu\).
1 2 3 4 5
The probability that the null hypothesis is true, or the probability that the alternative hypothesis is false.
The probability of obtaining the observed sample statistic, or some value more extreme than that, conditional on the assumption that the null hypothesis is true.
1 2 3 4 5
Null hypothesis significance testing (NHST) is only meaningful when power is high.
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
Data: the underlying random variable (Y) producing the data
Examples:
Probability distribution \(p(Y)\):
Examples:
Probability distribution \(p(Y)\):
1 2 3 4 5
Here, \(n\) represents the total number of trials, \(k\) the number of successes, and \(\theta\) the probability of success. The term \(\binom{n}{k}\), the number of ways in which one can choose \(k\) successes out of \(n\) trials, expands to \(\frac{n!}{k!(n-k)!}\).
1 2 3 4 5
Assumption: Each data point in the vector of data y is independent of the others.
The kernel density of the normal PDF: \(g(x|\mu,\sigma)=\exp \left(-\frac{(x-\mu)^2}{2\sigma^2} \right)\), and \(k\int{g(x)dx=1}\).
Probability in a continuous distribution is the area under the curve \(P(X<u)=\int_{-\infty}^uf(x)dx\), and will always be zero at any point value \(P(X=x)=0\).
1 2 3 4 5
1 2 3 4 5
Expectation \(E(x)\): the weighted mean of the possible outcomes, weighted by the respective probabilities of each outcome.
Variance is defined as: \(Var(X) = E[X^2]-E[Y]^2\)
Expectation: \(E\left[Y\right]=\sum_y y \cdot f(y)=n\theta\)
Variance: \(Var(X)=n\theta(1-\theta)\)
Expectation: \(E[X]=\int xf(x)dx=\mu\)
Variance: \(Var(X)=\sigma^2\)
1 2 3 4 5
The likelihood function \(\mathcal{L}(\theta \mid k,n)\) refers to the PMF \(p(k|n,\theta)\), treated as a function of \(\theta\).
Suppose that we record \(n = 10\) trials, and observe \(k = 7\) successes (heads in coin tosses).
\[\begin{equation} \mathcal{L}(\theta \mid k=7,n=10)= \binom{10}{7} \theta^{7} (1-\theta)^{10-7} \end{equation}\]1 2 3 4 5
The concept of“Integrating out a parameter”
Marginal likelihood: the likelihood computed by “marginalizing” out the parameter \(\theta\). It is a kind of weighted sum of the likelihood, weighted by the possible values of the parameter.
1 2 3 4 5
We have a joint PMF \(p_{X,Y} (x, y)\) for each possible pair of values of X and Y.
The marginal distributions: \[\begin{equation} p_{X}(x)=\sum_{y\in S_{Y}}p_{X,Y}(x,y) \end{equation}\]
\[\begin{equation} p_{Y}(y)=\sum_{x\in S_{X}}p_{X,Y}(x,y) \end{equation}\]The conditional distributions: \[\begin{equation} p_{X\mid Y}(x\mid y) = \frac{p_{X,Y}(x,y)}{p_Y(y)} \end{equation}\]
\[\begin{equation} p_{Y\mid X}(y\mid x) = \frac{p_{X,Y}(x,y)}{p_X(x)} \end{equation}\]1 2 3 4 5
PDF and CDF (\(\rho=0\)):
The joint distribution of X and Y: \[\begin{equation} \begin{pmatrix} X \\ Y \end{pmatrix}\sim \mathcal N_{2}\left(\begin{pmatrix} \mu_{X} \\ \mu_{Y} \end{pmatrix}, \begin{pmatrix} \sigma _{X}^2 & \rho_{XY}\sigma _{X}\sigma _{Y}\\ \rho_{XY}\sigma _{X}\sigma _{Y} & \sigma _{Y}^2\\ \end{pmatrix} \right) \end{equation}\]
In the variance-covariance matrix, \(\rho_{XY}\) is the correlation between X and Y; their covariance is \(Cov(X,Y) = \rho_{XY}\sigma _{X}\sigma _{Y}\).
The area under the curve sums to 1. \[\begin{equation} \iint_{S_{X,Y}} f_{X,Y}(x,y)\, dx dy = 1 \end{equation}\]
The marginal distributions:
\[\begin{equation} f_X(x) = \int_{S_Y} f_{X,Y}(x,y)\, dy \quad f_Y(y) = \int_{S_X} f_{X,Y}(x,y)\, dx \end{equation}\]1 2 3 4 5
The probability of A given B: \[\begin{equation} P(A|B)=\dfrac{P(A,B)}{P(B)}=\frac{P(B|A)P(A)}{P(B)} \hbox{ where } P(B)>0 \end{equation}\]
Since the probability A and B happening is the same as the probability of B and A happening \(P(B,A)=P(A,B)\), we have:
\[\begin{equation} P(B,A)=P(B|A)P(A)=P(A|B)P(B)=P(A,B). \end{equation}\]Rearranging terms:
\[\begin{equation} P(B|A)=\frac{P(A|B)P(B)}{P(A)} \end{equation}\]This is Bayes’ rule.
You are an art dealer, and have been invited to purchase what may be a lost Leonardo da Vinci masterpiece. The seller has allowed you to view the painting under two conditions: you will have five minutes to examine it and you must decide on the spot whether to submit an offer or not. The price is $100,000,000. If the painting is genuine, it would be worth $300,000,000. If it is fake, it would be worth nothing. Before examining it, you assess that there is a 20% chance that it is genuine. When you examine it, you administer a test that always yields a positive result if the painting is genuine and yields a positive result 25% of the time if is fake. The test yields a positive result.
1 2 3 4 5
1 2 3 4 5
If \(y\) is a vector of data points, we have:
\[\begin{equation} f(\theta\mid y) = \frac{f(y\mid \theta) f(\theta)}{f(y)} \end{equation}\]Here \(f(\cdot)\) is a PDF (continuous case) or a PMF (discrete case).
We can rewrite it as follows:
\[\begin{equation} Posterior = \frac{Likelihood \cdot Prior}{Marginal Likelihood} \end{equation}\]Prior \(f(\theta)\): the initial probability distribution of the parameter(s), before seeing the data.
Posterior \(f(\theta \mid y)\): the probability distribution of the parameters conditional on the data.
Likelihood \(f(y \mid \theta)\): the PMF (discrete case) or the PDF (continuous case) expressed as a function of parameters \(\theta\).
Marginal Likelihood \(f(y)\): It standardizes the posterior distribution to ensure that the area under the curve of the distribution sums to 1.
1 2 3 4 5
If \(y\) is a vector of data points, we have:
\[\begin{equation} f(\theta\mid y) = \frac{f(y\mid \theta) f(\theta)}{f(y)} \end{equation}\]Here \(f(\cdot)\) is a PDF (continuous case) or a PMF (discrete case).
We can rewrite it as follows:
\[\begin{equation} Posterior = \frac{Likelihood \cdot Prior}{Marginal Likelihood} \end{equation}\]Marginal Likelihood \(f(y)\): It functions as the “normalisating constant”.
\[\begin{equation} f(y)=\int f(y,\theta) d\theta = \int f(y \mid \theta)f(\theta)d\theta \end{equation}\]Without it, we have: \[\begin{equation} f(\theta\mid y) \propto f(y\mid \theta) f(\theta) \end{equation}\]
\[\begin{equation} \hbox{Posterior} \propto \hbox{Likelihood}\times \hbox{Prior} \end{equation}\]1 2 3 4 5
\(P(Data \mid Theory)\)
\(P(Theory \mid Data)\)
1 2 3 4 5
1 2 3 4 5
1. Choosing a likelihood
The responses follow a binomial distribution:
\[\begin{equation} f(x\mid n,\theta) = {x \choose n}\theta^{x} (1-\theta)^{n-x} \end{equation}\]1 2 3 4 5
2. Choosing a prior distribution
We need a distribution that can represent our uncertainty about the probability \(\theta\) of success.
The beta distribution is commonly used as prior for proportions.
\[\begin{equation*} f(\theta)= \left\{ \begin{array}{l l} \frac{1}{B(a,b)} \theta^{a - 1} (1-\theta)^{b-1} & \textrm{if } 0< \theta < 1\\ 0 & \textrm{otherwise}\\ \end{array} \right. \end{equation*}\]Where \(B(a,b)=\int_0^1 \theta^{a-1}(1-\theta)^{b-1}\, d\theta\), a normalizing constant that ensures that the area under the curve sums to 1.
The expectation and variance of Beta PDF:
\[\begin{equation} E[\theta]=\frac{a}{a+b} \end{equation}\] \[\begin{equation} Var(\theta)=\frac{ab}{\left(a+b\right)^{2}\left(a+b+1\right)} \end{equation}\]1 2 3 4 5
3. Conjugate analysis based on Bayes’ rule
To get the posterior distribution for \(\theta\), we have (ignore the normalising constant \({x \choose n}\)):
\[\begin{equation} f(\theta\mid x) \propto f(x\mid n, \theta) f(\theta) \end{equation}\]We multiply the likelihood and the prior: \[\begin{equation} \theta^{x} (1-\theta)^{n-x}\theta^{a-1}(1-\theta)^{b-1}=\theta^{a+x-1}(1-\theta)^{b+n-x-1} \end{equation}\]
We computed the posterior up to proportionality.
Thus, given a \(Binomial(n, x \mid \theta)\) likelihood, and a \(Beta(a,b)\) prior on \(\theta\), the posterior will be \(Beta(a+x, b+n-x)\).
1 2 3 4 5
The conjugate case of the beta-binomial:
In realistic data-analysis settings, the likelihood function will be very complex, and many parameters will be involved.
1 2 3 4 5
1 2 3 4 5
MCMC (Markov Chain Monte Carlo) sampling
Bayes’ rule: \[\begin{equation} \begin{aligned} p(\Theta|Y) &= \cfrac{ p(Y|\Theta) \cdot p(\Theta) }{p(Y)}\\ p(\Theta|Y) &= \cfrac{ p(Y|\Theta) \cdot p(\Theta) }{\int_{\Theta} p(Y|\Theta) \cdot p(\Theta) d\Theta } \end{aligned} \end{equation}\]
1 2 3 4 5
It works when we know the closed form of the PDF we want to simulate from and can derive the inverse of that function.
1 2 3 4 5
In trace plot, we see a characteristic “fat hairy caterpillar” shape, and we say that the sampler has converged.
The first few samples (warm-ups) were dropped, because the initial samples may not yet be sampling from the posterior.
We usually run four parallel chains with different initial values chosen randomly. They all end up sampling from the same distribution and overlap visually (mixing).
brms1 2 3 4 5
Data:
Suppose we have data from a single subject repeatedly pressing the space bar as fast as possible, without paying attention to any stimuli. The data are response times in milliseconds in each trial.
RQ: How long it takes to press a key for this subject?
brms1 2 3 4 5
Assumptions: \[\begin{equation} rt_n \sim \mathit{Normal}(\mu, \sigma) \end{equation}\]
\[\begin{equation} rt_n = \mu + \varepsilon \hbox{, where } \varepsilon_n \stackrel{iid}{\sim} \mathit{Normal}(0,\sigma) \end{equation}\]brms1 2 3 4 5
library(brms)
fit_press <- brm(rt ~ 1,
data = df_spacebar,
family = gaussian(),
prior = c(
prior(uniform(0, 60000), class = Intercept, lb = 0,
ub = 60000),
prior(uniform(0, 2000), class = sigma, lb = 0,
ub = 2000)
),
chains = 4,
iter = 2000,
warmup = 1000
)
SAMPLING FOR MODEL 'f7a7fc24ef4802fd1ddd47ff5ebae03c' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.000123 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.23 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.047495 seconds (Warm-up)
Chain 1: 0.047387 seconds (Sampling)
Chain 1: 0.094882 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'f7a7fc24ef4802fd1ddd47ff5ebae03c' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 1.4e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.14 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.057239 seconds (Warm-up)
Chain 2: 0.046728 seconds (Sampling)
Chain 2: 0.103967 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'f7a7fc24ef4802fd1ddd47ff5ebae03c' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 2.2e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.045737 seconds (Warm-up)
Chain 3: 0.046846 seconds (Sampling)
Chain 3: 0.092583 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'f7a7fc24ef4802fd1ddd47ff5ebae03c' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 2.2e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.051913 seconds (Warm-up)
Chain 4: 0.047835 seconds (Sampling)
Chain 4: 0.099748 seconds (Total)
Chain 4:
Sampling Specifications:
chains:the number of independent runs for sampling (default: 4)
iter: the number of iterations that the sampler makes to sample from the posterior distribution of each parameter (default: 2000)
warmup: the number of iterations from the start of sampling that are eventually discarded (default: half of iter)
brms1 2 3 4 5
brms1 2 3 4 5
# A draws_df: 3 iterations, 1 chains, and 4 variables
b_Intercept sigma lprior lp__
1 169 25 -19 -1683
2 169 25 -19 -1683
3 168 25 -19 -1683
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}
[1] 168.6085
2.5% 97.5%
166.0246 171.0875
[1] 24.98079
2.5% 97.5%
23.17134 26.83562
brms1 2 3 4 5
brms1 2 3 4 5
This sensitivity analysis showed that the posterior is not overly affected by the choice of prior.
brms1 2 3 4 5
This talk is dedicated to the SMLP 2022. All the materials are from https://vasishth.github.io/.